Analysis and Development of Urdu POS Tagged Corpus

نویسندگان

  • Ahmed Muaz
  • Aasim Ali
  • Sarmad Hussain
چکیده

In this paper, two corpora of Urdu (with 110K and 120K words) tagged with different POS tagsets are used to train TnT and Tree taggers. Error analysis of both taggers is done to identify frequent confusions in tagging. Based on the analysis of tagging, and syntactic structure of Urdu, a more refined tagset is derived. The existing tagged corpora are tagged with the new tagset to develop a single corpus of 230K words and the TnT tagger is retrained. The results show improvement in tagging accuracy for individual corpora to 94.2% and also for the merged corpus to 91%. Implications of these results are discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated part - of - speech analysis of Urdu : conceptual and technical issues

Part-of-speech (POS) tagging is the process of labelling tokens in a text with tags that indicate their morphosyntactic category, and has a wide range of applications in computational and corpus linguistics, such as the production of corpus-based dictionaries and grammars. This paper describes an experiment in extending POS tagging to a hitherto untagged language, Urdu. The most challenging tas...

متن کامل

A hybrid approach to Urdu verb phrase chunking

A variety of verb phrases exist in Urdu including simple verb phrases, conjunct verb phrases and compound verb phrases. This paper explains the structure of Urdu verb phrases, and details a series of experiment to automatically tag them. Initially, a rule based model is developed using 21 linguistic rules for automatic VP chunking. A 100,000 word Urdu corpus is manually tagged with VP chunk tag...

متن کامل

A Data-Driven Dependency Parser for Urdu

One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first data-driven dependency parser for Urdu, which has developed using MaltParser system and trained and evaluated on data from Urdu dependency treebank. A 40...

متن کامل

Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank

This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed...

متن کامل

Morphological Ending – based Strategies of Unknown Word Estimation for Statistical POS Urdu Tagger

Natural language processing has widely used Statistical based language models to solve disambiguation problems. Over the past decades different techniques regarding POS tagging have been proposed for English, European and East Asian languages. In this paper our focus is POS tagging for Urdu due to the infancy stage of Urdu language based tagging system. We have combined two approaches (Statisti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009